Goto

Collaborating Authors

 question-answer pair


e3a0db7c0a191854c176af1d20cdec80-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

The descriptions of each task are as follows:799 Single-view tasks Single-view tasks test a model's ability to infer spatial properties from a single800 image. These tasks include:801 Depth estimation (OC, OO, NA): Predicting absolute or relative depth values for objects802 Distance prediction (OC, OO, NA): Estimating the Euclidean distance between objects or803 from an object to the camera.804 Object center distance inference (OO, MCA): Given objects A, B and C, determine which805 of B and C is farther or closer to A.806 Object spatial relation (OO, MCA): Determining relative positioning (e.g., left, right, in807 Spatial imagination (OC, OO, MCA): Predicting unseen spatial relationships based on809 limited visual information.810 Multi-view tasks Multi-view tasks require reasoning across multiple images to infer spatial rela-811 tionships. These tasks include:812 Viewpoint change inference (NA): Given two perspectives, output how the camera should813 be moved to see the second perspective.814 Multi-view distance prediction (OC, OO, NA): Estimating object distances across different816 views.817 Multi-view object matching (MCA): Identifying the same object across multiple views.818


Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Neural Information Processing Systems

Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets--including WebQSPand CWQ--we find that the average factual correctness rate is only 57%. To address these issues, we introduce KGQAGen, an LLM-inthe-loop framework that systematically resolves these pitfalls. KGQAGencombines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a 10K-scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation 1.


Mellow: a small audio language model for reasoning

Neural Information Processing Systems

Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential applications for edge devices. To address this gap, we introduce Mellow, a small Audio-Language Model specifically designed for reasoning. Mellow achieves state-of-the-art performance among existing small audio-language models and surpasses several larger models in reasoning capabilities. For instance, Mellow scores 52.11 on MMAU, comparable to SoTAQwen2 Audio (which scores 52.5) while using 50 times fewer parameters and being trained on 60 times less data (audio hrs).


Bohdi: Heterogeneous LLMFusion with Automatic Data Exploration

Neural Information Processing Systems

While promising, existing methods suffer from two major limitations: 1) reliance on real data from limited domain for knowledge fusion, preventing the target LLM from fully acquiring knowledge across diverse domains, and 2) fixed data allocation proportions across domains, failing to dynamically adjust according to the target LLM's varying capabilities across domains, leading to a capability imbalance. To overcome these limitations, we propose Bohdi, a synthetic-data-only heterogeneous LLM fusion framework. Through the organization of knowledge domains into a hierarchical tree structure, Bohdi enables automatic domain exploration and multi-domain data generation through multimodel collaboration, thereby comprehensively extracting knowledge from source LLMs. By formalizing domain expansion and data sampling proportion allocation on the knowledge tree as a Hierarchical Multi-Armed Bandit problem, Bohdi leverages the designed DynaBranches mechanism to adaptively adjust sampling proportions based on the target LLM's performance feedback across domains. Integrated with our proposed Introspection-Rebirth (IR) mechanism, DynaBranches dynamically tracks capability shifts during target LLM's updates via Sliding Window Binomial Likelihood Ratio Testing (SWBLRT), further enhancing its online adaptation capability. Comparative experimental results on a comprehensive suite of benchmarks demonstrate that Bohdi significantly outperforms existing baselines on multiple target LLMs, exhibits higher data efficiency, and virtually eliminates the imbalance in the target LLM's capabilities. Our code is available at Bohdi.


Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Neural Information Processing Systems

AF3 introduces: CMM (i) AF-Whisper, a unified audio encoder trainedPrevious SOTA (Closed Source) using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multiaudio chat; (iv) long audio understanding and reasoning (including speech) up MMSU to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, (avg.)


MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Neural Information Processing Systems

This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine-grained visual details and complex logic across images. Inspired by self-supervised visual representation learning, we observe that images contain inherent constraints that can serve as supervision. Based on this insight, we construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. During training, the model is prompted to generate a reasoning process to compare these images (i.e., determine same or different).